A Normalizer for UGC in Brazilian Portuguese

نویسندگان

  • Magali Sanches Duran
  • Maria das Graças Volpe Nunes
  • Lucas Avanço
چکیده

User-generated contents (UGC) represent an important source of information for governments, companies, political candidates and consumers. However, most of the Natural Language Processing tools and techniques are developed from and for texts of standard language, and UGC is a type of text especially full of creativity and idiosyncrasies, which represents noise for NLP purposes. This paper presents UGCNormal, a lexicon-based tool for UGC normalization. It encompasses a tokenizer, a sentence segmentation tool, a phonetic-based speller and some lexicons, which were originated from a deep analysis of a corpus of product reviews in Brazilian Portuguese. The normalizer was evaluated in two different data sets and carried out from 31% to 89% of the appropriate corrections, depending on the type of text noise. The use of UGCNormal was also validated in a task of POS tagging, which improved from 91.35% to 93.15% in accuracy and in a task of opinion classification, which improved the average of F1-score measures (F1-score positive and F1-score negative) from 0.736 to 0.758.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

‘Minor’ Languages, ‘Broken’ Translations: On Brazilian Reworkings of an Albanian Novel

This essay approaches the challenges of global translation in the 21st century from what might still be considered a somewhat uncommon example: a direct translation of Ismail Kadaré's 1978 novel Prill e thyër (Broken April) from the original Albanian into Brazilian Portuguese in 2001. Not only does it examine and compare lexical elements in the source and target texts and the usage of translato...

متن کامل

Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization

Text normalization techniques based on rules, lexicons or supervised training requiring large corpora are not scalable nor domain interchangeable, and this makes them unsuitable for normalizing user-generated content (UGC). Current tools available for Brazilian Portuguese make use of such techniques. In this work we propose a technique based on distributed representation of words (or word embed...

متن کامل

DIXI - portuguese text-to-speech system

This paper describes the software architecture of the Portuguese text-to-speech system DIXI. The system has three major modules. The rst one contains the text normalizer and searches each word in the lexicon. The second one is a multi-level rule based module for lexical stress assignment, orthographic to phonetic transcription, metrically based prosodic patterning and for generating the evoluti...

متن کامل

Cyclic Orbit Codes with the Normalizer of a Singer Subgroup

An algebraic construction for constant dimension subspace codes is called orbit code. It arises as the orbits under the action of a subgroup of the general linear group on subspaces in an ambient space. In particular orbit codes of a Singer subgroup of the general linear group has investigated recently. In this paper, we consider the normalizer of a Singer subgroup of the general linear group a...

متن کامل

Translation, cultural adaptation and validation for Brazilian Portuguese of the Cardiff Acne Disability Index instrument*

BACKGROUND The Cardiff Acne Disability Index was originally developed in English for measuring quality of life of acne patients. Considering the psychosocial impact of this disease, it is important to have instruments culturally and linguistically validated for use in Brazilian adolescents. OBJECTIVE To translate the Cardiff Acne Disability Index into Brazilian Portuguese, culturally adapt it...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015